Visakhapatnam
Small Vision-Language Models: A Survey on Compact Architectures and Techniques
Patnaik, Nitesh, Nayak, Navdeep, Agrawal, Himani Bansal, Khamaru, Moinak Chinmoy, Bal, Gourav, Panda, Saishree Smaranika, Raj, Rishi, Meena, Vishal, Vadlamani, Kartheek
The emergence of small vision-language models (sVLMs) marks a critical advancement in multimodal AI, enabling efficient processing of visual and textual data in resource-constrained environments. This survey offers a comprehensive exploration of sVLM development, presenting a taxonomy of architectures - transformer-based, mamba-based, and hybrid - that highlight innovations in compact design and computational efficiency. Techniques such as knowledge distillation, lightweight attention mechanisms, and modality pre-fusion are discussed as enablers of high performance with reduced resource requirements. Through an in-depth analysis of models like TinyGPT-V, MiniGPT-4, and VL-Mamba, we identify trade-offs between accuracy, efficiency, and scalability. Persistent challenges, including data biases and generalization to complex tasks, are critically examined, with proposed pathways for addressing them. By consolidating advancements in sVLMs, this work underscores their transformative potential for accessible AI, setting a foundation for future research into efficient multimodal systems.
Zipfian Whitening
Yokoi, Sho, Bao, Han, Kurita, Hiroto, Shimodaira, Hidetoshi
The word embedding space in neural models is skewed, and correcting this can improve task performance. We point out that most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform; in reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law. Surprisingly, simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance, surpassing established baselines. From a theoretical perspective, both our approach and existing methods can be clearly categorized: word representations are distributed according to an exponential family with either uniform or Zipfian base measures. By adopting the latter approach, we can naturally emphasize informative low-frequency words in terms of their vector norm, which becomes evident from the information-geometric perspective, and in terms of the loss functions for imbalanced classification. Additionally, our theory corroborates that popular natural language processing methods, such as skip-gram negative sampling, WhiteningBERT, and headless language models, work well just because their word embeddings encode the empirical word frequency into the underlying probabilistic model.
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Zhou, Yujun, Yang, Jingdong, Guo, Kehan, Chen, Pin-Yu, Gao, Tian, Geyer, Werner, Moniz, Nuno, Chawla, Nitesh V, Zhang, Xiangliang
Laboratory accidents pose significant risks to human life and property, underscoring the importance of robust safety protocols. Despite advancements in safety training, laboratory personnel may still unknowingly engage in unsafe practices. With the increasing reliance on large language models (LLMs) for guidance in various fields, including laboratory settings, there is a growing concern about their reliability in critical safety-related decision-making. Unlike trained human researchers, LLMs lack formal lab safety education, raising questions about their ability to provide safe and accurate guidance. Existing research on LLM trustworthiness primarily focuses on issues such as ethical compliance, truthfulness, and fairness but fails to fully cover safety-critical real-world applications, like lab safety. To address this gap, we propose the Laboratory Safety Benchmark (LabSafety Bench), a comprehensive evaluation framework based on a new taxonomy aligned with Occupational Safety and Health Administration (OSHA) protocols. This benchmark includes 765 multiple-choice questions verified by human experts, assessing LLMs and vision language models (VLMs) performance in lab safety contexts. Our evaluations demonstrate that while GPT-4o outperforms human participants, it is still prone to critical errors, highlighting the risks of relying on LLMs in safety-critical environments. Our findings emphasize the need for specialized benchmarks to accurately assess the trustworthiness of LLMs in real-world safety applications.
EB-NeRD: A Large-Scale Dataset for News Recommendation
Kruse, Johannes, Lindskow, Kasper, Kalloori, Saikishore, Polignano, Marco, Pomo, Claudio, Srivastava, Abhishek, Uppal, Anshuk, Andersen, Michael Riis, Frellsen, Jes
Personalized content recommendations have been pivotal to the content experience in digital media from video streaming to social networks. However, several domain specific challenges have held back adoption of recommender systems in news publishing. To address these challenges, we introduce the Ekstra Bladet News Recommendation Dataset (EB-NeRD). The dataset encompasses data from over a million unique users and more than 37 million impression logs from Ekstra Bladet. It also includes a collection of over 125,000 Danish news articles, complete with titles, abstracts, bodies, and metadata, such as categories. EB-NeRD served as the benchmark dataset for the RecSys '24 Challenge, where it was demonstrated how the dataset can be used to address both technical and normative challenges in designing effective and responsible recommender systems for news publishing. The dataset is available at: https://recsys.eb.dk.
RecSys Challenge 2024: Balancing Accuracy and Editorial Values in News Recommendations
Kruse, Johannes, Lindskow, Kasper, Kalloori, Saikishore, Polignano, Marco, Pomo, Claudio, Srivastava, Abhishek, Uppal, Anshuk, Andersen, Michael Riis, Frellsen, Jes
The RecSys Challenge 2024 aims to advance news recommendation by addressing both the technical and normative challenges inherent in designing effective and responsible recommender systems for news publishing. This paper describes the challenge, including its objectives, problem setting, and the dataset provided by the Danish news publishers Ekstra Bladet and JP/Politikens Media Group ("Ekstra Bladet"). The challenge explores the unique aspects of news recommendation, such as modeling user preferences based on behavior, accounting for the influence of the news agenda on user interests, and managing the rapid decay of news items. Additionally, the challenge embraces normative complexities, investigating the effects of recommender systems on news flow and their alignment with editorial values. We summarize the challenge setup, dataset characteristics, and evaluation metrics. Finally, we announce the winners and highlight their contributions. The dataset is available at: https://recsys.eb.dk.
Multi-Agent Obstacle Avoidance using Velocity Obstacles and Control Barrier Functions
Roncero, Alejandro Sánchez, Muchacho, Rafael I. Cabral, Ögren, Petter
Velocity Obstacles (VO) methods form a paradigm for collision avoidance strategies among moving obstacles and agents. While VO methods perform well in simple multi-agent environments, they don't guarantee safety and can show overly conservative behavior in common situations. In this paper, we propose to combine a VO-strategy for guidance with a CBF-approach for safety, which overcomes the overly conservative behavior of VOs and formally guarantees safety. We validate our method in a baseline comparison study, using 2nd order integrator and car-like dynamics. Results support that our method outperforms the baselines w.r.t. path smoothness, collision avoidance, and success rates.
Modeling Urban Transport Choices: Incorporating Sociocultural Aspects
Salazar-Serna, Kathleen, Cadavid, Lorena, Franco, Carlos J.
By understanding how users decide on their commuting modes, it is possible to identify factors that can be influenced to change travel behavior and promote the adoption of more sustainable transportation modes. Agent-based modeling (ABM) is particularly valuable for this purpose, as it can represent complex systems like transportation and identify emerging collective behaviors resulting from the autonomous decisions of transport users interacting among them and with the environment (Kagho, Balac, and Axhausen 2020). These capabilities make ABM suitable for analyzing the impacts of transport policies (Wise, Crooks, and Batty 2017). However, the application of ABM in analyzing transport mode choices has been limited and studies have been conducted predominantly in developed countries (Cadavid and Salazar-Serna 2021; Salazar-Serna, Cadavid, Franco, and Carley 2023). The effectiveness of these findings may not extend seamlessly to developing regions due to different contextual policy needs and the distinct ways socioeconomic and cultural factors influence human behavior (Carley 1991; Salazar-Serna et al. 2023). Therefore, policies that have been successful in one setting might not achieve similar outcomes in another. Previous studies in transportation have identified various determinants affecting mode choice. These factors can be grouped into several categories: sociodemographic characteristics such as age, sex, occupation, and income level (Ashalatha et al. 2013); travel habits including distance traveled, travel time, origin-destination pairs, and trip purpose (Madhuwanthi et al. 2016); and attributes of the built environment like design, density, and capacity (Ewing and Cervero 2010). Additionally, attitudes and perceptions regarding transport modes, which cover aspects such as comfort, cost, security, safety, quality, and reliability, play a crucial role (Fu 2021).
Accelerating Drug Safety Assessment using Bidirectional-LSTM for SMILES Data
Rao, K. Venkateswara, Rao, Kunjam Nageswara, Ratnam, G. Sita
Computational methods are useful in accelerating the pace of drug discovery. Drug discovery carries several steps such as target identification and validation, lead discovery, and lead optimisation etc., In the phase of lead optimisation, the absorption, distribution, metabolism, excretion, and toxicity properties of lead compounds are assessed. To address the issue of predicting toxicity and solubility in the lead compounds, represented in Simplified Molecular Input Line Entry System (SMILES) notation. Among the different approaches that work on SMILES data, the proposed model was built using a sequence-based approach. The proposed Bi-Directional Long Short Term Memory (BiLSTM) is a variant of Recurrent Neural Network (RNN) that processes input molecular sequences for the comprehensive examination of the structural features of molecules from both forward and backward directions. The proposed work aims to understand the sequential patterns encoded in the SMILES strings, which are then utilised for predicting the toxicity of the molecules. The proposed model on the ClinTox dataset surpasses previous approaches such as Trimnet and Pre-training Graph neural networks(GNN) by achieving a ROC accuracy of 0.96. BiLSTM outperforms the previous model on FreeSolv dataset with a low RMSE value of 1.22 in solubility prediction.